多标签遥感图像分类(MLRSIC)已获得越来越多的研究兴趣。将多个标签的辅助关系作为其他信息有助于提高此任务的性能。当前方法着重于使用它来限制卷积神经网络(CNN)的最终功能输出。一方面,这些方法不会充分利用标签相关来形成特征表示。另一方面,它们增加了系统的标签噪声灵敏度,导致稳健性差。在本文中,提出了一种称为语义交织的全球通道注意(Signa)的新颖方法。首先,根据数据集的统计信息获得标签共发生图。标签共发生图用作图形神经网络(GNN)的输入,以生成最佳特征表示。然后,语义特征和视觉特征交错,以指导图像从原始特征空间到具有嵌入式标签关系的语义特征空间的特征表达。 Signa在新的语义特征空间中触发了特征地图通道的全球关注,以提取更重要的视觉特征。提出了基于多头签名的功能自适应加权网络,以插件的方式对任何CNN作用。对于遥感图像,可以通过将CNN插入浅层层来实现更好的分类性能。我们对三个数据集进行了广泛的实验比较:UCM数据集,AID数据集和DFC15数据集。实验结果表明,与最新方法(SOTA)方法相比,所提出的Signa具有出色的分类性能。值得一提的是,本文的代码将向社区开放,以进行可重复性研究。我们的代码可在https://github.com/kyle-one/signa上找到。
translated by 谷歌翻译
A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
translated by 谷歌翻译
Cooperative multi-agent reinforcement learning (c-MARL) is widely applied in safety-critical scenarios, thus the analysis of robustness for c-MARL models is profoundly important. However, robustness certification for c-MARLs has not yet been explored in the community. In this paper, we propose a novel certification method, which is the first work to leverage a scalable approach for c-MARLs to determine actions with guaranteed certified bounds. c-MARL certification poses two key challenges compared with single-agent systems: (i) the accumulated uncertainty as the number of agents increases; (ii) the potential lack of impact when changing the action of a single agent into a global team reward. These challenges prevent us from directly using existing algorithms. Hence, we employ the false discovery rate (FDR) controlling procedure considering the importance of each agent to certify per-state robustness and propose a tree-search-based algorithm to find a lower bound of the global reward under the minimal certified perturbation. As our method is general, it can also be applied in single-agent environments. We empirically show that our certification bounds are much tighter than state-of-the-art RL certification solutions. We also run experiments on two popular c-MARL algorithms: QMIX and VDN, in two different environments, with two and four agents. The experimental results show that our method produces meaningful guaranteed robustness for all models and environments. Our tool CertifyCMARL is available at https://github.com/TrustAI/CertifyCMA
translated by 谷歌翻译
We present Hybrid Infused Reranking for Passages Retrieval (HYRR), a framework for training rerankers based on a hybrid of BM25 and neural retrieval models. Retrievers based on hybrid models have been shown to outperform both BM25 and neural models alone. Our approach exploits this improved performance when training a reranker, leading to a robust reranking model. The reranker, a cross-attention neural model, is shown to be robust to different first-stage retrieval systems, achieving better performance than rerankers simply trained upon the first-stage retrievers in the multi-stage systems. We present evaluations on a supervised passage retrieval task using MS MARCO and zero-shot retrieval tasks using BEIR. The empirical results show strong performance on both evaluations.
translated by 谷歌翻译
Drug-Drug Interactions (DDIs) prediction is an essential issue in the molecular field. Traditional methods of observing DDIs in medical experiments require plenty of resources and labor. In this paper, we present a computational model dubbed MedKGQA based on Graph Neural Networks to automatically predict the DDIs after reading multiple medical documents in the form of multi-hop machine reading comprehension. We introduced a knowledge fusion system to obtain the complete nature of drugs and proteins and exploited a graph reasoning system to infer the drugs and proteins contained in the documents. Our model significantly improves the performance compared to previous state-of-the-art models on the QANGAROO MedHop dataset, which obtained a 4.5% improvement in terms of DDIs prediction accuracy.
translated by 谷歌翻译
Evaluating automatically-generated text summaries is a challenging task. While there have been many interesting approaches, they still fall short of human evaluations. We present RISE, a new approach for evaluating summaries by leveraging techniques from information retrieval. RISE is first trained as a retrieval task using a dual-encoder retrieval setup, and can then be subsequently utilized for evaluating a generated summary given an input document, without gold reference summaries. RISE is especially well suited when working on new datasets where one may not have reference summaries available for evaluation. We conduct comprehensive experiments on the SummEval benchmark (Fabbri et al., 2021) and the results show that RISE has higher correlation with human evaluations compared to many past approaches to summarization evaluation. Furthermore, RISE also demonstrates data-efficiency and generalizability across languages.
translated by 谷歌翻译
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
translated by 谷歌翻译
Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet. However, they either ignore the effect of representative prototypes or fail to enhance the prototypes with multimodal information adequately. In this work, we propose a novel Multimodal Prototype-Enhanced Network (MORN) to use the semantic information of label texts as multimodal information to enhance prototypes, including two modality flows. A CLIP visual encoder is introduced in the visual flow, and visual prototypes are computed by the Temporal-Relational CrossTransformer (TRX) module. A frozen CLIP text encoder is introduced in the text flow, and a semantic-enhanced module is used to enhance text features. After inflating, text prototypes are obtained. The final multimodal prototypes are then computed by a multimodal prototype-enhanced module. Besides, there exist no evaluation metrics to evaluate the quality of prototypes. To the best of our knowledge, we are the first to propose a prototype evaluation metric called Prototype Similarity Difference (PRIDE), which is used to evaluate the performance of prototypes in discriminating different categories. We conduct extensive experiments on four popular datasets. MORN achieves state-of-the-art results on HMDB51, UCF101, Kinetics and SSv2. MORN also performs well on PRIDE, and we explore the correlation between PRIDE and accuracy.
translated by 谷歌翻译
Learning continuous image representations is recently gaining popularity for image super-resolution (SR) because of its ability to reconstruct high-resolution images with arbitrary scales from low-resolution inputs. Existing methods mostly ensemble nearby features to predict the new pixel at any queried coordinate in the SR image. Such a local ensemble suffers from some limitations: i) it has no learnable parameters and it neglects the similarity of the visual features; ii) it has a limited receptive field and cannot ensemble relevant features in a large field which are important in an image; iii) it inherently has a gap with real camera imaging since it only depends on the coordinate. To address these issues, this paper proposes a continuous implicit attention-in-attention network, called CiaoSR. We explicitly design an implicit attention network to learn the ensemble weights for the nearby local features. Furthermore, we embed a scale-aware attention in this implicit attention network to exploit additional non-local information. Extensive experiments on benchmark datasets demonstrate CiaoSR significantly outperforms the existing single image super resolution (SISR) methods with the same backbone. In addition, the proposed method also achieves the state-of-the-art performance on the arbitrary-scale SR task. The effectiveness of the method is also demonstrated on the real-world SR setting. More importantly, CiaoSR can be flexibly integrated into any backbone to improve the SR performance.
translated by 谷歌翻译
Large deep learning models have achieved remarkable success in many scenarios. However, training large models is usually challenging, e.g., due to the high computational cost, the unstable and painfully slow optimization procedure, and the vulnerability to overfitting. To alleviate these problems, this work studies a divide-and-conquer strategy, i.e., dividing a large model into smaller modules, training them independently, and reassembling the trained modules to obtain the target model. This approach is promising since it avoids directly training large models from scratch. Nevertheless, implementing this idea is non-trivial, as it is difficult to ensure the compatibility of the independently trained modules. In this paper, we present an elegant solution to address this issue, i.e., we introduce a global, shared meta model to implicitly link all the modules together. This enables us to train highly compatible modules that collaborate effectively when they are assembled together. We further propose a module incubation mechanism that enables the meta model to be designed as an extremely shallow network. As a result, the additional overhead introduced by the meta model is minimalized. Though conceptually simple, our method significantly outperforms end-to-end (E2E) training in terms of both final accuracy and training efficiency. For example, on top of ViT-Huge, it improves the accuracy by 2.7% compared to the E2E baseline on ImageNet-1K, while saving the training cost by 43% in the meantime. Code is available at https://github.com/LeapLabTHU/Model-Assembling.
translated by 谷歌翻译